White Wine Data Exploration by Shahab Sheikh-Bahaei

Introduction

In this project I explore a data set of white wines. The data set contains 4,898 white wines with 11 variables on quantifying the chemical properties of each wine. At least 3 wine experts rated the quality of each wine, providing a rating between 0 (very bad) and 10 (very excellent) [1].

##  fixed.acidity   volatile.acidity  citric.acid    residual.sugar  
##  Min.   : 4.60   Min.   :0.1200   Min.   :0.000   Min.   : 0.900  
##  1st Qu.: 7.10   1st Qu.:0.3900   1st Qu.:0.090   1st Qu.: 1.900  
##  Median : 7.90   Median :0.5200   Median :0.260   Median : 2.200  
##  Mean   : 8.32   Mean   :0.5278   Mean   :0.271   Mean   : 2.539  
##  3rd Qu.: 9.20   3rd Qu.:0.6400   3rd Qu.:0.420   3rd Qu.: 2.600  
##  Max.   :15.90   Max.   :1.5800   Max.   :1.000   Max.   :15.500  
##    chlorides       free.sulfur.dioxide total.sulfur.dioxide
##  Min.   :0.01200   Min.   : 1.00       Min.   :  6.00      
##  1st Qu.:0.07000   1st Qu.: 7.00       1st Qu.: 22.00      
##  Median :0.07900   Median :14.00       Median : 38.00      
##  Mean   :0.08747   Mean   :15.87       Mean   : 46.47      
##  3rd Qu.:0.09000   3rd Qu.:21.00       3rd Qu.: 62.00      
##  Max.   :0.61100   Max.   :72.00       Max.   :289.00      
##     density             pH          sulphates         alcohol     
##  Min.   :0.9901   Min.   :2.740   Min.   :0.3300   Min.   : 8.40  
##  1st Qu.:0.9956   1st Qu.:3.210   1st Qu.:0.5500   1st Qu.: 9.50  
##  Median :0.9968   Median :3.310   Median :0.6200   Median :10.20  
##  Mean   :0.9967   Mean   :3.311   Mean   :0.6581   Mean   :10.42  
##  3rd Qu.:0.9978   3rd Qu.:3.400   3rd Qu.:0.7300   3rd Qu.:11.10  
##  Max.   :1.0037   Max.   :4.010   Max.   :2.0000   Max.   :14.90  
##     quality     
##  Min.   :3.000  
##  1st Qu.:5.000  
##  Median :6.000  
##  Mean   :5.636  
##  3rd Qu.:6.000  
##  Max.   :8.000
## 'data.frame':    1599 obs. of  12 variables:
##  $ fixed.acidity       : num  7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
##  $ volatile.acidity    : num  0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
##  $ citric.acid         : num  0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
##  $ residual.sugar      : num  1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
##  $ chlorides           : num  0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
##  $ free.sulfur.dioxide : num  11 25 15 17 11 13 15 15 9 17 ...
##  $ total.sulfur.dioxide: num  34 67 54 60 34 40 59 21 18 102 ...
##  $ density             : num  0.998 0.997 0.997 0.998 0.998 ...
##  $ pH                  : num  3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
##  $ sulphates           : num  0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
##  $ alcohol             : num  9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
##  $ quality             : int  5 5 5 6 5 5 5 7 7 5 ...

Univariate Plots Section

## Using  as id variables
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.

qplot(d$fixed.acidity,data=d,geom="freqpoly", binwidth=0.5 )#,color=as.factor(quality))

qplot(d$fixed.acidity,data=d,geom="freqpoly", binwidth=0.1)#,color=as.factor(quality))

ggplot(aes(x=1,y=d$fixed.acidity),data=d)+geom_boxplot()

summary(d$fixed.acidity)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    4.60    7.10    7.90    8.32    9.20   15.90
qplot(d$volatile.acidity,data=d,geom="freqpoly")#,color=as.factor(quality))
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.

qplot(d$volatile.acidity,data=d,geom="freqpoly",binwidth=0.01)#,color=as.factor(quality))

ggplot(aes(x=1,y=d$volatile.acidity),data=d)+geom_boxplot()

summary(d$volatile.acidity)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.1200  0.3900  0.5200  0.5278  0.6400  1.5800
qplot(d$citric.acid,data=d,geom="freqpoly")#,color=as.factor(quality))
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.

qplot(d$citric.acid,data=d,geom="freqpoly",binwidth=0.01)#,color=as.factor(quality))

ggplot(aes(x=1,y=d$citric.acid),data=d)+geom_boxplot()

summary(d$citric.acid)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   0.090   0.260   0.271   0.420   1.000
qplot(d$residual.sugar,data=d,geom="freqpoly")#,color=as.factor(quality))
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.

qplot(d$residual.sugar,data=d,geom="freqpoly",binwidth=0.1)#,color=as.factor(quality))

ggplot(aes(x=1,y=d$residual.sugar),data=d)+geom_boxplot()

summary(d$residual.sugar)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.900   1.900   2.200   2.539   2.600  15.500
qplot(d$chlorides,data=d,geom="freqpoly")#,color=as.factor(quality))
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.

qplot(d$chlorides,data=d,geom="freqpoly",binwidth=0.001)#,color=as.factor(quality))

ggplot(aes(x=1,y=d$chlorides),data=d)+geom_boxplot()

summary(d$chlorides)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.01200 0.07000 0.07900 0.08747 0.09000 0.61100
qplot(d$free.sulfur.dioxide,data=d,geom="freqpoly")#,color=as.factor(quality))
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.

qplot(d$free.sulfur.dioxide,data=d,geom="freqpoly",binwidth=1)#,color=as.factor(quality))

ggplot(aes(x=1,y=d$free.sulfur.dioxide),data=d)+geom_boxplot()

summary(d$free.sulfur.dioxide)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1.00    7.00   14.00   15.87   21.00   72.00
qplot(d$total.sulfur.dioxide,data=d,geom="freqpoly")#,color=as.factor(quality))
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.

qplot(d$total.sulfur.dioxide,data=d,geom="freqpoly",binwidth=1)#,color=as.factor(quality))

ggplot(aes(x=1,y=d$total.sulfur.dioxide),data=d)+geom_boxplot()

summary(d$total.sulfur.dioxide)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    6.00   22.00   38.00   46.47   62.00  289.00
qplot(d$density,data=d,geom="freqpoly")#,color=as.factor(quality))
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.

qplot(d$density,data=d,geom="freqpoly",binwidth=0.0001)#,color=as.factor(quality))

ggplot(aes(x=1,y=d$density),data=d)+geom_boxplot()

summary(d$density)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.9901  0.9956  0.9968  0.9967  0.9978  1.0040
qplot(d$pH,data=d,geom="freqpoly")#,color=as.factor(quality))
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.

qplot(d$pH,data=d,geom="freqpoly",binwidth=0.01)#,color=as.factor(quality))

ggplot(aes(x=1,y=d$pH),data=d)+geom_boxplot()

summary(d$pH)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   2.740   3.210   3.310   3.311   3.400   4.010
qplot(d$sulphates,data=d,geom="freqpoly")#,color=as.factor(quality))
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.

qplot(d$sulphates,data=d,geom="freqpoly",binwidth=0.01)#,color=as.factor(quality))

ggplot(aes(x=1,y=d$sulphates),data=d)+geom_boxplot()

summary(d$sulphates)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.3300  0.5500  0.6200  0.6581  0.7300  2.0000
qplot(d$alcohol,data=d,geom="freqpoly")#,color=as.factor(quality))
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.

qplot(d$alcohol,data=d,geom="freqpoly",binwidth=0.1)#,color=as.factor(quality))

ggplot(aes(x=1,y=d$alcohol),data=d)+geom_boxplot()

summary(d$alcohol)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    8.40    9.50   10.20   10.42   11.10   14.90
qplot(d$AA_score,data=d,geom="freqpoly")#,color=as.factor(quality))
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.

qplot(d$AA_score,data=d,geom="freqpoly",binwidth=0.1)#,color=as.factor(quality))

ggplot(aes(x=1,y=d$AA_score),data=d)+geom_boxplot()

summary(d$AA_score)
##        V1         
##  Min.   :-6.8196  
##  1st Qu.:-1.6896  
##  Median :-0.2724  
##  Mean   : 0.0000  
##  3rd Qu.: 1.7281  
##  Max.   : 7.0841
qplot(d$fraction.sulfur.dioxide,data=d,geom="freqpoly")#,color=as.factor(quality))
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.

qplot(d$fraction.sulfur.dioxide,data=d,geom="freqpoly",binwidth=0.01)#,color=as.factor(quality))

ggplot(aes(x=1,y=d$fraction.sulfur.dioxide,),data=d)+geom_boxplot()

summary(d$fraction.sulfur.dioxide)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.02273 0.25930 0.37500 0.38230 0.48480 0.85710

Univariate Analysis

What is the structure of your dataset?

Number of Instances: 4898.

Number of Attributes: 11

Attributes are the following physicochemical properties:

1 - fixed acidity (tartaric acid - g / dm^3)

2 - volatile acidity (acetic acid - g / dm^3)

3 - citric acid (g / dm^3)

4 - residual sugar (g / dm^3)

5 - chlorides (sodium chloride - g / dm^3

6 - free sulfur dioxide (mg / dm^3)

7 - total sulfur dioxide (mg / dm^3)

8 - density (g / cm^3)

9 - pH

10 - sulphates (potassium sulphate - g / dm3)

11 - alcohol (% by volume)

Output variable (based on sensory data):

12 - quality (score between 0 and 10)

What is/are the main feature(s) of interest in your dataset?

The univariate plots does not distinguish any features as beging particularly interesting. However, intuitively I expect that alcohol percentage should play a significant role in quality ratings.

What other features in the dataset do you think will help support your investigation into your feature(s) of interest?

Based on description of attributes provided in [1], ‘volatile acidity’ and ‘citric acid’ might also play important role becaulse too much volatile acidity leads to an unpleasant flavor, and citric acid can add freshness and flavor to wines.

Did you create any new variables from existing variables in the dataset?

Yes, I created two new features as follows:

1. fraction.sulfur.dioxide = free.sulfur.dioxide/total.sulfur.dioxide

2. AA_score = z(alcohol) + z(citric.acid) - z(volatile.acidity)

AA_score stands for Alcohol and Acidity score. It’s an intuitive way to combine alcohol, citric acid, volatile acidity of wines. z(x) denotes z-score of x; volatile.acidity was subtracted because it contributes to unpleasant flavor.

Of the features you investigated, were there any unusual distributions? Did you perform any operations on the data to tidy, adjust, or change the form of the data? If so, why did you do this?

Most distributions looked bell shaped with a longer right tail. The most unusual distribution belonged to citric.acid. total.sulfure.dioxide and alcohol had longer tails than others. The data was already tidy. No adjustments were needed.

Bivariate Plots Section

## 
## Attaching package: 'psych'
## 
## The following object is masked from 'package:car':
## 
##     logit
## 
## The following object is masked from 'package:ggplot2':
## 
##     %+%

##                         fixed.acidity volatile.acidity citric.acid
## fixed.acidity              1.00000000     -0.256130895  0.67170343
## volatile.acidity          -0.25613089      1.000000000 -0.55249568
## citric.acid                0.67170343     -0.552495685  1.00000000
## residual.sugar             0.11477672      0.001917882  0.14357716
## chlorides                  0.09370519      0.061297772  0.20382291
## free.sulfur.dioxide       -0.15379419     -0.010503827 -0.06097813
## total.sulfur.dioxide      -0.11318144      0.076470005  0.03553302
## density                    0.66804729      0.022026232  0.36494718
## pH                        -0.68297819      0.234937294 -0.54190414
## sulphates                  0.18300566     -0.260986685  0.31277004
## alcohol                   -0.06166827     -0.202288027  0.10990325
## quality                    0.12405165     -0.390557780  0.22637251
## fraction.sulfur.dioxide   -0.13081236     -0.072618561 -0.16693889
## AA_score                   0.39828994     -0.806903815  0.76442244
##                         residual.sugar    chlorides free.sulfur.dioxide
## fixed.acidity              0.114776724  0.093705186        -0.153794193
## volatile.acidity           0.001917882  0.061297772        -0.010503827
## citric.acid                0.143577162  0.203822914        -0.060978129
## residual.sugar             1.000000000  0.055609535         0.187048995
## chlorides                  0.055609535  1.000000000         0.005562147
## free.sulfur.dioxide        0.187048995  0.005562147         1.000000000
## total.sulfur.dioxide       0.203027882  0.047400468         0.667666450
## density                    0.355283371  0.200632327        -0.021945831
## pH                        -0.085652422 -0.265026131         0.070377499
## sulphates                  0.005527121  0.371260481         0.051657572
## alcohol                    0.042075437 -0.221140545        -0.069408354
## quality                    0.013731637 -0.128906560        -0.050656057
## fraction.sulfur.dioxide   -0.070626080 -0.105156413         0.327240869
## AA_score                   0.084486905 -0.036149794        -0.055125752
##                         total.sulfur.dioxide     density          pH
## fixed.acidity                    -0.11318144  0.66804729 -0.68297819
## volatile.acidity                  0.07647000  0.02202623  0.23493729
## citric.acid                       0.03553302  0.36494718 -0.54190414
## residual.sugar                    0.20302788  0.35528337 -0.08565242
## chlorides                         0.04740047  0.20063233 -0.26502613
## free.sulfur.dioxide               0.66766645 -0.02194583  0.07037750
## total.sulfur.dioxide              1.00000000  0.07126948 -0.06649456
## density                           0.07126948  1.00000000 -0.34169933
## pH                               -0.06649456 -0.34169933  1.00000000
## sulphates                         0.04294684  0.14850641 -0.19664760
## alcohol                          -0.20565394 -0.49617977  0.20563251
## quality                          -0.18510029 -0.17491923 -0.05773139
## fraction.sulfur.dioxide          -0.37143493 -0.26497991  0.18489507
## AA_score                         -0.11339013 -0.07047315 -0.26265953
##                            sulphates     alcohol     quality
## fixed.acidity            0.183005664 -0.06166827  0.12405165
## volatile.acidity        -0.260986685 -0.20228803 -0.39055778
## citric.acid              0.312770044  0.10990325  0.22637251
## residual.sugar           0.005527121  0.04207544  0.01373164
## chlorides                0.371260481 -0.22114054 -0.12890656
## free.sulfur.dioxide      0.051657572 -0.06940835 -0.05065606
## total.sulfur.dioxide     0.042946836 -0.20565394 -0.18510029
## density                  0.148506412 -0.49617977 -0.17491923
## pH                      -0.196647602  0.20563251 -0.05773139
## sulphates                1.000000000  0.09359475  0.25139708
## alcohol                  0.093594750  1.00000000  0.47616632
## quality                  0.251397079  0.47616632  1.00000000
## fraction.sulfur.dioxide -0.010459139  0.24627450  0.19411335
## AA_score                 0.306868847  0.60338613  0.50263963
##                         fraction.sulfur.dioxide    AA_score
## fixed.acidity                       -0.13081236  0.39828994
## volatile.acidity                    -0.07261856 -0.80690381
## citric.acid                         -0.16693889  0.76442244
## residual.sugar                      -0.07062608  0.08448690
## chlorides                           -0.10515641 -0.03614979
## free.sulfur.dioxide                  0.32724087 -0.05512575
## total.sulfur.dioxide                -0.37143493 -0.11339013
## density                             -0.26497991 -0.07047315
## pH                                   0.18489507 -0.26265953
## sulphates                           -0.01045914  0.30686885
## alcohol                              0.24627450  0.60338613
## quality                              0.19411335  0.50263963
## fraction.sulfur.dioxide              1.00000000  0.06987323
## AA_score                             0.06987323  1.00000000
##                                [,1]
## fixed.acidity            0.12405165
## volatile.acidity        -0.39055778
## citric.acid              0.22637251
## residual.sugar           0.01373164
## chlorides               -0.12890656
## free.sulfur.dioxide     -0.05065606
## total.sulfur.dioxide    -0.18510029
## density                 -0.17491923
## pH                      -0.05773139
## sulphates                0.25139708
## alcohol                  0.47616632
## quality                  1.00000000
## fraction.sulfur.dioxide  0.19411335
## AA_score                 0.50263963
## geom_smooth: method="auto" and size of largest group is >=1000, so using gam with formula: y ~ s(x, bs = "cs"). Use 'method = x' to change the smoothing method.

## geom_smooth: method="auto" and size of largest group is >=1000, so using gam with formula: y ~ s(x, bs = "cs"). Use 'method = x' to change the smoothing method.

## geom_smooth: method="auto" and size of largest group is >=1000, so using gam with formula: y ~ s(x, bs = "cs"). Use 'method = x' to change the smoothing method.

## geom_smooth: method="auto" and size of largest group is >=1000, so using gam with formula: y ~ s(x, bs = "cs"). Use 'method = x' to change the smoothing method.

## geom_smooth: method="auto" and size of largest group is >=1000, so using gam with formula: y ~ s(x, bs = "cs"). Use 'method = x' to change the smoothing method.

## geom_smooth: method="auto" and size of largest group is >=1000, so using gam with formula: y ~ s(x, bs = "cs"). Use 'method = x' to change the smoothing method.

## geom_smooth: method="auto" and size of largest group is >=1000, so using gam with formula: y ~ s(x, bs = "cs"). Use 'method = x' to change the smoothing method.

## geom_smooth: method="auto" and size of largest group is >=1000, so using gam with formula: y ~ s(x, bs = "cs"). Use 'method = x' to change the smoothing method.

## geom_smooth: method="auto" and size of largest group is >=1000, so using gam with formula: y ~ s(x, bs = "cs"). Use 'method = x' to change the smoothing method.

## geom_smooth: method="auto" and size of largest group is >=1000, so using gam with formula: y ~ s(x, bs = "cs"). Use 'method = x' to change the smoothing method.

## geom_smooth: method="auto" and size of largest group is >=1000, so using gam with formula: y ~ s(x, bs = "cs"). Use 'method = x' to change the smoothing method.

Bivariate Analysis

Talk about some of the relationships you observed in this part of the investigation. How did the feature(s) of interest vary with other features in the dataset?

‘quality’ had the highest correlation with alcohol, volatile.acidity and citric.acid, respectively. ‘alcohol’ did not have significant correlation with any other feature in the data set. ‘volatile.acidity’ and ‘citric.acid’ had a relatively high negative correlation coefficient (R=-0.55).

‘quality’ and sulphates appear to have an interesting relationship. Quality increases as sulphates increases up to 0.9 and then it decreases. There are 8 outlier wines with sulphates higher than 1.5 with moderate quality ranging between 4 to 8.

Did you observe any interesting relationships between the other features (not the main feature(s) of interest)?

fixed.acidity had high correlations with citric.acid (R=0.67), density (R=0.66), and pH (R=-0.68).

What was the strongest relationship you found?

The strongest relationship was between fixed.acidity and pH (R=-0.68) which is no surprise.

Multivariate Plots Section

## 
## Call:
## lm(formula = quality ~ fixed.acidity + volatile.acidity + citric.acid + 
##     residual.sugar + chlorides + free.sulfur.dioxide + total.sulfur.dioxide + 
##     density + pH + sulphates + alcohol, data = d)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.68911 -0.36652 -0.04699  0.45202  2.02498 
## 
## Coefficients:
##                        Estimate Std. Error t value Pr(>|t|)    
## (Intercept)           2.197e+01  2.119e+01   1.036   0.3002    
## fixed.acidity         2.499e-02  2.595e-02   0.963   0.3357    
## volatile.acidity     -1.084e+00  1.211e-01  -8.948  < 2e-16 ***
## citric.acid          -1.826e-01  1.472e-01  -1.240   0.2150    
## residual.sugar        1.633e-02  1.500e-02   1.089   0.2765    
## chlorides            -1.874e+00  4.193e-01  -4.470 8.37e-06 ***
## free.sulfur.dioxide   4.361e-03  2.171e-03   2.009   0.0447 *  
## total.sulfur.dioxide -3.265e-03  7.287e-04  -4.480 8.00e-06 ***
## density              -1.788e+01  2.163e+01  -0.827   0.4086    
## pH                   -4.137e-01  1.916e-01  -2.159   0.0310 *  
## sulphates             9.163e-01  1.143e-01   8.014 2.13e-15 ***
## alcohol               2.762e-01  2.648e-02  10.429  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.648 on 1587 degrees of freedom
## Multiple R-squared:  0.3606, Adjusted R-squared:  0.3561 
## F-statistic: 81.35 on 11 and 1587 DF,  p-value: < 2.2e-16

Multivariate Analysis

Talk about some of the relationships you observed in this part of the investigation. Were there features that strengthened each other in terms of looking at your feature(s) of interest?

I was able to find three features that can be very goog predictors of wine quality: Alcohol, Citric Acid, and Volatile Acidity. More specifically, quality increases with increase in Alcohol and Citric Acid and decreases with increase of Volatile Acidity.

Were there any interesting or surprising interactions between features?

Surfure Diaxide had an interesting relationship with quality. From Log distributions stratified by quality, it is apparent that low Sulfure Dioxide is associated with both very low and very high quality wine, while medium quality wine tend to have medium Sulfure Diaxide concentrations.

OPTIONAL: Did you create any models with your dataset? Discuss the strengths and limitations of your model.

Yes I created a linear regression model using ‘lm’ function. The model identifies ‘volatile.acidity’, ‘chlorides’, ‘pH’, ‘sulphates’, and ‘alcohol’ as having significant relationship with quality. The adjusted R-squared was 0.3567. I tried to improve the R-squared by removing unimportant variables such as density, or free.sulfure.dioxide. However removing any of the variables resulted in a lower adjusted R-squared.

Surprisingly, citric.acid did not have a significantly high coefficient.


Final Plots and Summary

Plot One

Description One

This plot simply shows the distribution of percentage alcohol stratified by wine quality. It shows a large overlap, however, wines with higher alcohol percentage have higher chance of having higher quality. A new variable, namely ‘rating’, is defined to make it easier to understand. Wines with quality 3-4 are defined as ‘Low’, 5-6 as ‘Medium’ and 7-8 as ‘High’ rating.

Although it is a very simple plot, I think it depicts very useful information for an average wine buyer. Basically, based on this plot, white wines with alcohol percentage greater than 12 have a much higher chance of being high quality wines.

Plot Two

Description Two

This is a scatter plot of intuitively defined variable ‘AA_score’ vs ‘sulphate’ colored by the newly defined ‘rating’ variable. It illustrates a fairly good separation of high quality wines from the rest of the wines. Sulphate was selected based on linear regression analysis because it had a coefficient significantly larger than zero.

Plot Three

## geom_smooth: method="auto" and size of largest group is >=1000, so using gam with formula: y ~ s(x, bs = "cs"). Use 'method = x' to change the smoothing method.

Description Three

This plot shows the relationship between the quality estimated by the linear regression model and the actual quality. Jitter is added to the actual quality to reduce overplotting. The plot illustrates that the estimated (predicted) quality has a fairly good correlation with the actual quality.


Reflection

The white wine dataset contained 1599 wines with 12 measured physicochemical properties. The data was tidy and did not need any cleaning or adjustments. The questions I was interested in was the following:

1. Which wine attributes play the most important role in the quality of white wine?

2. Is it possible to estimate (predict) the quality of white wine from its measured attributes using a model?

To answer the above wuestions, I started exploring the dataset by creating histograms of every single variable to get a feeling for how the variables are distributed and identify possible outliers. Most distributions looked bell shaped with a longer right tail. The most unusual distribution belonged to citric.acid. total.sulfure.dioxide and alcohol had longer tails than others.

After that I investigated the relationship between ‘quality’ and each variable separately. Based on correlation analysis, quality had the highest linear correlation with alcohol, volatile.acidity and citric.acid, respectively. Scatter plots and multivariate analysis showed some interesting nonlinear relationships; for example quality increased with sulphates up to 0.9 (g / dm3) and then it decreased for sulphates higher than 0.9 (g / dm3). Another example was Surfure Diaxide: low Sulfure Dioxide was associated with both very low and very high quality wine, while medium quality wine tend to have medium Sulfure Diaxide concentrations.

Finaly, I was able to determine four attributes that contribute to whilte wine quality: Alcohol, Citric Acid, Sulphate and Volatile Acidity. A linear regression model was used to model the data. The fitted model had an adjusted R-squared value of 0.36 (p < 2.2e-16). The model identified ‘volatile.acidity’, ‘chlorides’, ‘pH’, ‘sulphates’, and ‘alcohol’ as having significant relationship with quality. The linear regression results was consistent with my findings except for citric.acid, which did not have a significantly high coefficient. This might be due to the moderate correlation between citric acid and volatile acidity and non-linear relationships in the data. For this reason, I think a non-linear, more sophisticated model, such as Support Vector Machine (SVM) or Random Forest is more suitable to model this data. Based on results reported in [1], the top 5 attributes predicting white wine quality are: sulphates, alcohol, residual sugar, citric acid and total sulfure dioxide, which are different than my findings based on linear regression and visual exploration of data.

References

[1] P. Cortez, A. Cerdeira, F. Almeida, T. Matos and J. Reis. Modeling wine preferences by data mining from physicochemical properties. In Decision Support Systems, Elsevier, 47(4):547-553. ISSN: 0167-9236.